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Abstract 

We consider the problem of dynamically scheduling J jobs on N processors for non-preemptive 
execution where the value of each job (or the reward garnered upon completion) decays over time. All 
jobs are initially available in a buffer and the distribution of their service times are known. When a 
processor becomes available, one must determine which free job to schedule so as to maximize the total 
expected reward accrued for the completion of all jobs. Such problems arise in diverse application areas, 
e.g. scheduling of patients for medical procedures, supply chains of perishable goods, packet scheduling 
for delay-sensitive communication network traffic, etc. Computation of optimal schedules is generally 
intractable, while online low-complexity schedules are often essential in practice. 

It is shown that the simple greedy/myopic schedule provably achieves performance within a factor 
2 + ^" la *^ J | from optimal. This bound can be improved to a factor of 2 when the service times are 
identically distributed. Various aspects of the greedy schedule are examined and it is demonstrated to 
perform quite close to optimal in some practical situations despite the fact that it ignores reward-decay 
deeper in time. 



1 Introduction 

Consider a queueing/scheduling system (as in Fig. \Q, where a finite number J jobs wait in a buffer, each to 
be processed by one of N servers/processors. Time is slotted. The service/processing requirement, (Xj, of 
each job j is random and its distribution, fj{o~j), is known. All processors operate at service rate 1; hence, 
the service time for each job is invariant to the processor which it assigned. Service is non-preemptive (job 
service cannot be interrupted mid-processing to be resumed later or discontinued). The completion of job j 
in time slot t garners a reward Wj(t) > 0, which decays with time (i.e. Wj(t) is non-increasing in t). The 
goal is to schedule the jobs on the processors so as to maximize the aggregate reward accrued when all jobs 
complete execution. 

As will become clear below, a key complicating factor is that the job service is non-preemptive, inducing 
a 'combinatorial twist' on the problem. Under preemptive processing, the latter would wash away and the 
problem would become much simpler. Another complicating factor is the fact that the rewards/values Wj (t) 
decay over time in a general way; special cases might be significantly easier to handle (though still not 
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Figure 1: System Diagram: J jobs wait to be processed on one of n machines. The processing time of each 
job is independent of the processor and other jobs. 



necessarily easy). A third complicating factor is the general distributions of the stochastic job processing 
times Oj (even though these are independent across different jobs); for special distributions the problem 
can become significantly simpler (and the results tighter). We aim to address the problem in the most 
general setting arising in a variety of applications (see below), which may actually require online (real- 
time) schedule implementation. In that case, since the complexity of computing the optimal job schedule 
is prohibitive, one seeks simple and practical schedules (implementable online), which have performance 
within provable bounds from optimal. In this paper, we focus on a greedy/myopic schedule defined below 
and study its efficiency. We discuss these factors below in conjunction with prior work and a variety of 
applications. 

1.1 Applications 

There are diverse applications where job completion rewards decay over time. For example, such is the 
case with patient scheduling in health-care systems. Delays in treatment often lead to deterioration of 
patient health (see, for instance, [1]) which may result in reduction of the eventual treatment impact; this 
is obviously the case with various medical procedures, operations, etc. Indeed, a number of studies have 
demonstrated that delayed treatment results in increased patient mortality [2-6]. Moreover, in a related study 
[7], over 60% of physicians reported dissatisfaction with delays in viewing test results, which subsequently 
led to delays in treatment. It is likely that increased mortality is primarily induced via deterioration of 
patient health condition and resulting reduction of benefit from eventual treatment. This is how the effect of 
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treatment delay is modeled in this paper. 

On the other hand, in information technology, reward decay occurs in various situations-for example, in 
multimedia packet scheduling for transmission over wireless links. Each packet corresponds to a job which 
is completed once the packet is successfully received at the receiver; until then, it is repeatedly transmitted 
(non-preemptive processing). Transmission time until successful reception is random, due both to random 
packet sizes and randomly varying wireless channel quality. In the simplest case, video packets have a 
single deadline and reward is only received if the packet is received prior to its deadline expiration. In 
more advanced schemes, multiple deadlines are considered (decreasing, piecewise-constant reward decay 
function), reflecting coding interdependencies across packets. Indeed, even if a packet misses its initial 
deadline, it could improve the quality of the received and reconstructed video because other packets which 
depend on it may still be able to meet their deadline [8]. 

As with multimedia packet scheduling above and similar situations of task scheduling in parallel com- 
puting systems, we can consider jobs that contain interdependencies within our model. The completion of a 
single job j garners reward rj. However, other jobs may rely on that one too, either because they cannot be- 
gin processing until that is completed (due to data-passing, precedence constraints, etc.) or their processing 
accuracy/quality depends on output from that job (e.g. decoding dependencies). Therefore, the 'effective' 
reward generated is actually Wj(t) = rj — f(t), where the increasing function f(t) reflects the detrimental 
effect that completing job j after delay t has on other jobs depending on it. In fact, our formulation allows 
for the case where even rj is a decaying function in time. 

A third application area where job completion rewards may decay over time is in the case of perishable 
items, like food, medicine, etc. For example, the quality of food items (milk, eggs, etc.) decays with time. 
The scheduling problem is when to release these items for sale given varying transportation times (from 
storage to shelf) and the decaying reward R(t). It is also possible to have a cost s for each time slot the item 
remains in storage so that the effective reward of an item once it is released for sale is C(t) = R(t) — st. 

1.2 Literature Review 

When rewards do not decay over time but stay constant, job scheduling problems may be cast in the frame- 
work of 'multiarmed bandit' problems [9, 10]. Furthermore, optimal policies for certain 'well-behaved' 
decaying reward functions (such as linear and exponential) have been developed (see [9, 10] and related 
works). Unfortunately, under general decaying rewards, solving for the optimal schedule becomes very 
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difficult. 

There has been related work on delay-sensitive scheduling in networking. In the case of broadcast 
scheduling in computer networks, jobs correspond to requests for pages (files). Due to the broadcast nature 
of a wireless channel, multiple requests can be satisfied with the transmission of a single page. In [11], a 
greedy algorithm is shown to be a 2-approximation for throughput maximization of broadcast scheduling in 
the case of equal sized pages. In a similar scenario, an online preemptive algorithm is shown to be O(y^n) 
competitive where n is the number of pages that can be requested [12]. Our work differs from this prior 
work in that we allow for 1) arbitrary decaying rewards, rather restricting to step functions when the deadline 
expires, 2) jobs are non-preemptive and have varied lengths (and all jobs are available at time 0). 

A substantial body of work has focused on scheduling for perishable products (see [13] for a review). 
The focus is on finding an optimal ordering policy given the lifetime and demand of the perishable items. 
In [14], the authors study how to maximize utility garnered by delivering perishable goods, such as ready- 
mixed concrete, and minimize costs subject to stochasticity in transportation times. The authors formulate 
a mathematical program to solve the problem and propose heuristic algorithms for use in practice. Interest- 
ingly, the perishable items in this case have a fixed lifetime, after which they are rendered useless (deadline). 
Our formulation here allows for general decay. 

In [15], the authors look at how to schedule an M/M/l queue where rewards decay exponentially depen- 
dent on each job's sojourn time due to the 'impatient' nature of the users. A greedy policy is shown to be op- 
timal in the case of identical decay rates of these impatient users. Our scheduling problem is closely related 
to a number of instances of the Multiarmed Bandit Problem. When rewards exhibit 'well-behaved' decay, 
(identical rates, constant rates, etc.) it is possible to find optimal, or near-optimal algorithms [9, 10, 16, 17]. 
This is not always the case for arbitrary decay. 

In a problem similar to the one we study in this paper, a greedy algorithm is shown to be a 2-approximation 
when job completions generate rewards according to general decaying reward functions [18]. The main dis- 
tinction between this work and ours is that the previous work allows for job preemption while we consider 
the case that once a job is scheduled it occupies the machine until it completes. This constraint adds an extra 
layer of complexity. 

Indeed, non-preemption makes the scheduling problem we study substantially more difficult. Non- 
preemptive interval scheduling is studied in [12, 19] among others. Jobs can either be scheduled during 
their specified interval or rejected. The end of the interval corresponds to the deadline of the corresponding 
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job. If A is the ratio of the large job size to the smallest job size, then an online algorithm cannot be better 
than 0(log A). Our work differs from this prior work because we consider arbitrary decay of rewards and 
assume all jobs are available at time 0. The decaying reward functions make this a more general and difficult 
scheduling problem. However, our result also relies on A, the ratio between largest and smallest jobs. 

Still, there are instances where optimal schedules can be found for arbitrary decaying rewards. In a 
parallel scenario to ours, jobs can be scheduled, non-preemptively, multiple times. For this problem, the 
reward function for completing a particular job decays with the number of times that job has been completed. 
In this case, a greedy policy is optimal for arbitrary decaying rewards [10]. This problem is parallel to ours 
in that it allows for arbitrary decaying rewards. However, the decay does not depend on the completion time 
of the job, but rather on the number of times that job has been completed. In our case, each job is only 
processed a single time. 

Relating back to our scenario where the rewards decay with time, it is again the case that for 'well- 
behaved' decaying functions (linear and exponential), policies based on an index rule are optimal [9, 10]. 
The policy we propose in this paper is also an index rule. In fact, the proposed policy is very closely related 
to the 'c-/i'-type scheduling rules (see, for instance [10,20]) where the objective is to minimize cost (rather 
than maximize rewards) when costs are linearly or concavely increasing. One of the main distinctions 
between our work and this is that we consider multiple servers. Unfortunately, the optimality of the 'c-//' 
rule does not extend to this case. Furthermore, linear/concave decaying rewards are just single instances 
of our more general formulation of decaying rewards. It is also important to recognize that many of the 
results of this prior work are in heavy-traffic regimes where a lot of the fine-grained optimization required 
in non-heavy-traffic is washed out. 

1.3 Summary of Results 

In this paper, we study the efficacy of a greedy scheduling algorithm for non-preemptive jobs whose re- 
wards decay arbitrarily with time. There are a number of applications which exhibit such behavior such as 
patient scheduling in hospitals, packet scheduling in multimedia communication systems, and supply chain 
management for perishable goods. It is shown that finding an optimal scheduling policy for such systems is 
NP-hard. As such, finding simple heuristics is highly desirable. We show that a greedy algorithm is guar- 
anteed to be within a factor of A + 2 of optimal where A is the ratio of the largest job completion time to 
the smallest. This bound is improved in some special cases. Via numerical studies, we see that, in practice, 
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the greedy policy is likely to perform much closer to optimal which suggests it is a reasonable heuristic for 
practical deployment. To the best of our knowledge this is the first look at non-preemptive scheduling of 
jobs with arbitrary decaying rewards. 

The rest of the paper is structured as follows. In Section [2] we formally introduce the scheduling model 
we will study. In Section [3] we propose and study the performance of a greedy scheduling policy. The main 
result, which is a bound on the loss of efficiency due to greedy scheduling, is given in Section |3T2l In Section 
HJ we examine some special cases where this bound can be improved. In Section|5] we do some performance 
evaluation of the greedy policy via a simulation study. Finally, we conclude in Section [6] 

2 Model Formulation 

Consider a set of J jobs, indexed by j G S = {1, 2, . . . , J}, and N processors/servers, indexed by n G 
M = {1, 2, . . . , N}. Each job j G J has a random processing requirement aj and can be processed by any 
processor n G M. All processors have service rate 1 and each one can process a single job at a time. Service 
is non-preemptive in the sense that once a processor starts executing a job it cannot stop until completion. 
Time is slotted and indexed by i G {1,2, 3,...}. We denote the distribution of the service times by fj{o~j). 

Assumption 1 The random job processing times o~j,j G J are 1) statistically independent with P(o~j < 
oo) = 1 and 2) their distributions, fj(o~j), do not depend on time. 

However, the jobs processing times are not necessarily identically distributed. 

Let bj(t) be the residual service time of job j at time t. Initially, bj(0) = o~j, for each j G J . The 
backlog state of the system at time t is the vector 

b(t) = (b 1 (t),b 2 (t),...,b j (t),...,bj(t)). (1) 

It evolves from initial state b(0) = (ax, 02, o-j, a,j) to final state <?(T) = (0, 0, 0, 0) by assign- 
ing processors to process the jobs non-preemptively, until all jobs have finished execution at some (random) 
time T. Note that for each job j G J, bj (t) = aj implies that j has not started processing by t (has not been 
scheduled before t), while bj(t) = implies that the job finished execution before (or at) time t. Indeed, if 
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job j starts execution at time slot tj and finishes at the beginning of time slot Tj then aj = Tj — tj and 



t < U 



0. 



t > To 



(2) 



As discussed later, the start times tj are chosen by the scheduling policy, while the end times Tj are then 
determined by the fact that scheduling is non-preemptive so that Tj = tj + Oj. 

The job service times are random and their true values are not observable ex ante or known a priori; they 
can only be seen ex post, after a job has completed processing. However, the values Xj (t) tracking which 
jobs are completed at each time t 



Xj(t) 



1, if job j has not completed processing at time slot t 
0, if job j has completed by time slot t 



(3) 



are directly observable for each job j G J . We work below with the observable 'backlog state' 



x(t) = (xi(t),X 2 (t),...,Xj(t),...,Xj(t)) 



(4) 



in {0, 1} J which tracks which jobs are completed and which are still waiting to complete processing at time 
t. 

To fully specify the state of a job, we define yj(t) as the time slot t' < t in which job j begins processing. 
Specifically, 



t', if job j began procesing in time slot t' < t 

0, if job j has not begun processing prior to time slot t (necessarily, Xj(t) = 1) 



(5) 



Hence, any job with Uj(t) = (where is some null symbol) has not yet begun processing and is free to 
be scheduled. If Xj (t) = 1, then job j has not completed and it is still being processed due to the non- 
preemptive nature of the service discipline. Once a job is scheduled in time slot tj, then yj(t) = tj for all 
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t > tj. The service state is then, 



y(t) = (y 1 (t),y 2 (t),...,y 3 (t),...,y J (t)) (6) 

in {{0, 1, ...,t — 1} U 0} J and tracks when (and if) each job began processing. In time slot t, one can 
calculate the distribution for the remaining service time bj (t) given the distribution of <jj based on when (if) 
the job has started processing and whether it has completed. Only the distribution of bj(t) is known as the 
job service time is only observable once the job completes processing. Therefore, Xj and yj can be jointly 
leveraged to compute the distribution of the residual service time of job j. 

We next define the state z n (t) of processor n E J\f which tracks which job it is assigned to process in 
time slot t. Specifically, 



Z n {t) 



j, if processor nis soil executing jobi e J a, the begging of time slo, t (?> 
0, if processor n is free at the beginning of time slot t, hence, available for allocation 



and the processor state is 

z{t) = ( Zl {t), z 2 {t), z n (t), z N (t)) (8) 

in {0, 1, j, ...J} N and tracks the free vs. allocated processors at the beginning of time slot t. 

At the beginning of each time slot t, each job j with yj(t) = (not yet started) can be scheduled on 
(matched with) a processor n with z n (t) = (free) to start execution. The observable state of the system at 
the beginning of time slot t is 

s t = (x(t),y(t),z(i)) (9) 

Recall that from x(i) and y(t) we can determine the distribution of the remaining service time h(t). So 
the global state © yields the distribution for the remaining backlog and also tracks the processor state. The 
state space S is the set of all states the system may attain throughout its evolution. We denote by x(s) the 
projection of the state onto the x-coordinate. We similarly apply notation for y(s) and z(s). 

Given the free jobs and processors at state s t , we denote by A(s t ) the set of job-processor matchings 
(schedules) that can be selected, i.e. they are feasible, at the beginning of time slot t. These matchings are 
in addition to those already in place for jobs which are in mid-processing due to the non-preemptive nature 
of execution. Note that at each time t, for any feasible job-processor matching A £ A(st) we have that 
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(j,n) G A implies Xj(t) = 1, yj(t) = and z n (t) = 0, meaning processor n is free and job j has not 
started processing. Also, only one free job can be matched to each free processor and vice-versa (hence, 
(j, n), (k, m) G A with (j, n) 7^ (k, m) implies j / k and n 7^ m). Despite the fact that A(s t ) is clearly a 
function of st, we may occasionally suppress sj for notational simplicity. 

The completion of job j by the end of time slot t garners non-negative reward Wj(t). We assume the 
reward decays over time, as follows. 

Assumption 2 For each job j G J, the reward function Wj(t) > decays over time; that is, it is non- 
increasing in t (may be piece-wise constant). 

This immediately accounts for raw deadlines by setting Wj(t) = l{t<d ■} when dj is the deadline of job j. 

Recall that if job j is scheduled on processor n at the beginning of time slot t, it will finish by the 
beginning of time slot t + Oj. Therefore, the reward 'locked' at the beginning of time slot t, given that a 
job-processor match A G A(st) is chosen to be used in this slot, is simply 



It is desirable to design a control (scheduling, matching) policy choosing at each t a job-processor 
matching in A(s t ) to maximize the total expected reward accrued until all jobs have been executed. Since 
at time t the realization of aj is unknown for each job j that has not completed by t, any control policy is a- 
priori unaware of the exact reward accrued from a particular action at t. Only the statistics of this reward are 
known. Specifically, let ir be a scheduling policy which chooses a job-processor matching Kt(st) G A(st) 
at t, and let II be the set of all such policies. Define the expected total reward-to-go under a policy ir starting 
at state s G S at time slot t, as 



where T is the (random) time where all jobs have completed execution. T may depend on the policy 7r 
used. Note that if we wanted to consider a finite, deterministic horizon T, we could appropriately generate a 
schedule based on the modified, truncated reward functions, Wj(t), such that for all t < T, Wj(t) = Wj(t), 




(10) 



U,n)£A 



J?{s)=E Y, R t>(st>^ t ,(s t ,))\s t = s 



(11) 



lt'=t 
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otherwise uij(t) = 0. The expectation is taken over the random service times Oj of the jobs. We let 



J t *(s) = max J?(s) 

7TG11 



(12) 



denote the expected total reward-to-go under the optimal policy, it* = argmax^gn J[(s). 

The optimal reward-to-go function (or value function) J* and the optimal scheduling policy tt* can in 
principle be computed via dynamic programming. Once all jobs have been completed, x = and no more 
reward can be earned. Therefore, Jf(s) = for all s = (x, y, z) such that x = 0. 

Given the current state st = s and the matching A between free jobs and processors enabled at the 
beginning of time slot t, the system will transition to state st+i = s' at the beginning of time slot t + 1 with 
probabilities P^(st+i = s'\st = s). For example, if the service times Oj are geometrically distributed with 
probabilities pj correspondingly, and the system is in state s t = s = (x, y, z) and matching A G A(st) is 
chosen, then the system transitions to state st = s' = (x', y', z') with the following probabilities: 



Pa (x'j = 0\s) 



PaM = 1\s) 



PAiy'j = t\s) 



Pa{v] = Vj\s) 



P A(z'n =j\s) 



P A (z' n = 0\s) 



1, if x(s)j = 0; 

pj, if Uj(s) < t or (j, n) G A for some n; 

0, otherwise. 

1, if x(s)j = 1 and (j, n) A for all n; 
1 — Pj, if Vj(s) < t or (j, n) G A for some n; 

0, otherwise. 

1, if (j, n) G A for some n; 
0, otherwise. 

1; (j) n) ^ A for all n; 

0, otherwise. 

1 — pj, if z(s) n = j or {j, n) G A for some n; 

0, otherwise. 

Pj, if z(s) n = j or (j, n)ei for some n; 

1, if ^(s) n = and (j, n) G" A for all j; 
0, otherwise. 



(13) 
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We can now recursively obtain J* using the Bellman recursion 

J*(s) = max{£[ ^ Wj (t + a,) + J2 P ^ s *+i = s 'l s * = s ] J t+i( s ')} 

(j,n)eA s'&S 

= max [E[R t (s, A)] + J? +1 (S(s, A))]] (14) 

where S(s, A) is the random next state encountered given that we start in state s and action A is taken. 
The solution can be found using the value iteration method. 

Proposition 1 There exists an optimal control solution to (O which is obtainable via value iteration. 

Proof: Once the queue is emptied, Bellman's recursion terminates. When x = 0, there are no more jobs left 
to be processed. No action can generate any reward and the optimal policy will never leave this state once it 
reaches it. There exists a policy which will complete all jobs and cause the Bellman's recursion to terminate 
in finite time. (i.e. we process all jobs on a single server, n, in random order. Because P(aj < oo) = 1, all 
jobs will be completed in finite time.) This guarantees the existence of a stationary optimal policy which is 
obtainable via value iteration [21]. ■ 
Of course, this approach is computationally intractable: the state space (the set of all (x, y, z)) is expo- 
nentially large. As such, this makes such problems pragmatically difficult. 

2.1 A Hardness Result 

We now show that a special case of the non-preemptive scheduling problem is NP-hard. Consider a deter- 
ministic version of the problem where the completion time of job j is cr,- with probability 1. Let wj(t) = Vj 
fort<K and Wj(t) = otherwise. We can think of Vj as the value of job j and K as the shared deadline 
amongst all jobs. This version of the non-preemptive scheduling problem with decaying rewards can be 
reduced to the 0/1 Multiple-Knapsack Problem which is known to be NP-complete. 

Theorem 1 The non-preemptive scheduling problem with decaying rewards is NP-hard. 

Proof: In the case of the 0/1 Multiple Knapsack Problem, there are J objects of size <jj and value vj to be 
placed in N knapsacks of capacity K. Reward is only accrued if the entire object is placed in a knapsack- 
fractional objects are not possible. The optimal packing of objects is equal to the optimal scheduling policy 
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for the non-preemptive scheduling with decaying rewards problem. This reduction takes constant time. This 
completes the proof. ■ 



3 A Greedy Heuristic for Non-preemptive Scheduling of Decaying Jobs 

In light of Theorem [TJ finding an optimal policy for the scheduling problem at hand is computationally 
intractable. Therefore, it is highly desirable to find simple, but effective heuristics for practical deployment. 
In this section, we examine one such policy. 

A natural heuristic policy one may consider for the stochastic depletion problem is given by the greedy 
policy which in state s with F = ^ n 1 { z ( s ) n= o} free processors chooses the F available jobs with maximum 
expected utility rate earned over the following time-step, M?MHfHgij] _ That is 

g \ - E[Wj(t(s) + CTj)] 

Such a policy is adaptive but ignores the evolution of the reward functions, Wj(t), and its impact on rewards 
accrued in future states. We denote by Jf (s) the reward garnered by the greedy policy stalling in state s. 

3.1 Sub-optimality of Greedy Policy 

We start with an instructive example which demonstrates the nature (and degree) of sub-optimality of the 
greedy policy. 

Example 1 (Greedy Sub-Optimality) Consider the case with 2 jobs and 1 machine. Time is initialized to 
so that t = 0, J = 2 and N = 1. Assume that each job is waiting to begin processing so that x\ = xi = 1 
and yi = V2 = 0- The service times are Geometric and the expected service times for job 1 and 2 are M 
and 1, respectively, i.e. p\ = i and p2 = 1. The reward functions are: 



M 2 , t = l 
0, t>\ 



For j = 1 : Wj (t) = < 
For j = 2 : Wj (t) = jl + e , yt 



for e > 0. Hence, the completion of job 1 generates rewards of M if it is completed in the first time slot; 
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otherwise, no revenue is received. On the other hand, job 2 results generates reward ofl+e, regardless of 
which time slot it is completed in. Therefore, the reward rates are: 



E[ Wl (t + (Tl)] 

E\w 2 {t + a 2 )} 
E[a 2 \ 



1, t = 1 
0, t > 1 

1 + e, Vt 



In time slot t = 0, the greedy policy schedules job 2 because its reward rate (1 + e) is great than that of job 
1 (1). Job 2 completes processing in one time slot and generates reward of w 2 {l) = 1 + e. At t = 1, only 
job 1 remains to be processed. However, the service time for job 1 is at least one time slot, so when job 1 
is completed at t = 1 + a 2 > 1, reward is generated. Hence the greedy policy generates a total expected 
reward ofl + e. 

On the other hand, the optimal policy realizes the reward of job 1 is degrading and schedules it first. 
With probability 1/M, job 1 will complete by time slot t = 1 and generate reward M 2 . However, with 
probability 1 — 1/M it will take more than one time slot and generate no reward since W\{t) = for 
t > 1. Upon the completion of job 1, job 2 is scheduled and it completes processing in 1 time slot. Since 
w 2 (t) = 1 + efor all t, this results in additional reward ofl + e. Hence, the total expected reward generated 
by the optimal policy is M + 1 + e. Comparing the performance of the optimal and greedy policies gives 
J*(s)/J9(s) = (M + 1 + e)/(l + e). 

Letting e — > 0, it is easy to see that the greedy policy results in an M + 1 approximation, where M = 
min^gf^] = A. This suggests that the approximation of the greedy policy is dependent on the relationship 
between job service times. The following subsection specifies this relation. 

3.2 The Greedy Heuristic is an online (2 + A) -Approximation 

In this section we will show that the greedy heuristic is within a factor of 2 + A of optimal, where 

A = E[a ™* ] (16) 
miiij E[o~j\ 

Before we can prove this result, we need to first show a few properties of the system and the optimal value 
function, J t *. 
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We begin with a monotonicity property based on the number of jobs remaining to be processed. In- 
tuitively, if one were given an additional set of jobs to process, the reward that can be garnered by the 
completion of the original set of jobs in conjunction with the additional jobs will be more than if those extra 
jobs were not available. Consider two states: s and s' which are nearly identical except state s has more jobs 
to process than state s'. In other words, all jobs that have been completed in the s-system have also been 
completed in the s'-system. Similarly, any job that has started processing in the s-system has also started 
processing in the s'-system at the exact same time on the same machine. Any additional jobs in state s are 
jobs that have not started processing, but have already been completed in state s'. That is, the additional 
jobs are only available for processing in the s-system. Then the reward-to-go generated starting in state s is 
larger than that starting in state s'. The following lemma formalizes this intuition. 

Lemma 1 (Monotonicity in Jobs) Consider states s and s' such that state s has more jobs than state s' and 
any job that has started in state s' started processing in the exact same time slot in state s so that for each 
job j: x(s)j > Xj(s') and 



y{ s ')j, if x(s)j = x(s')j; 
0, if x(s)j > x(s')j. 



Also, in both states, each processor n is either not busy or busy processing the same job: z(s) n = z(s') n . 
For all states s and s' which satisfy these conditions, the following holds: 

■/?(«)> j?0O. 

Proof: Consider a coupling of the systems starting at s and s' such that they see the same realizations of 
service times o~j (and residual service times for jobs that have already started processing). This is possible 
for all jobs j G J s > = {j G J\x(s')j = 1} C J s = {j G J\x(s)j = 1} because they have the same 
distributions. J s and J s > denote the jobs to be completed under the systems stalling in states s and s', 
respectively. 

Let 7r*(s') denote the optimal scheduling policy starting from state s'. Consider a policy n that starts in 
state s and mimics ir*(s') until all jobs j G J s i are completed and completes the rest of the jobs j G J S \J S ' 
in sequential order. That is, under tt the scheduler initially pretends that jobs j G J s \ Js 1 do not exist and 
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uses the optimal policy under this assumption; once these jobs are completed, it processes the remaining 
jobs in an arbitrary order. Said another way, the tt policy blocks processing of the additional jobs in state 
s (j G S s \ Js 1 ) an d optimally processes the remaining jobs. Once these jobs (j G J s i) are completed, 
the tt policy 'unlocks' the remaining, additional jobs and processes them in an arbitrary manner. Fig. [2] 
demonstrates the relationship between tt*(s') and tt for a single server over a particular sample path for 
service times. 
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Figure 2: Monotonicity in Jobs: A single server scenario. The s-system is given 2 additional jobs, j = 5 
and j = 6. The s'-system uses policy vr*(s') to optimally process all jobs j = 1, 2, 3, 4. The s-system uses 
policy tt which mimics tt* (s ; ) until all jobs j = 1, 2, 3, 4 are completed at time T s / and then processes the 
remaining additional jobs. 



Let Tj be the completion time of job j for the s-system when using policy tt. Similarly, let Tj" be the 
completion time of job j for the s'-system under the optimal policy, tt*(s'). By our coupling, for all j G J s ', 
Tj = i.e. the completion time of job j is identical under the s-system which uses policy tt and under the 
s'-system which uses policy vr*(s'). (Notice in Fig. [2 jobs 1, 2, 3, 4 complete at the same time in the s' and 
s-systems). We use the notation Jl(s\a) as the optimal reward-to-go given the filtration of the job service 
times, i.e. given a sample path of realizations of the aj. We employ similar notation for jf . We have, 



Jt*(s\v) > J?(s\a) 



j&J s > jeJ s \J s , 
jeJ s \J s ' 

> j;(s» 



The first inequality comes from the optimality of J t *(-)- The first equality comes from the definition of the 
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reward function, Tj, and tt policy. The third equality comes from the coupling of the two systems so that 
Tj = TJ for all j G J a i. The last inequality comes from non-negative property of the rewards in Assumption 
|2] Taking expectations over cr,- yields the desired result. ■ 
Next, we consider a property of the optimal policy. In every time slot, there will be a set (possibly empty) 
of free machines (z n = 0). In each time slot, the optimal policy will assign a job to all free machines, 
assuming there are enough available jobs. That is, while there are still jobs waiting to be processed, no 
machine will idle under the optimal policy. 

Lemma 2 (Non-idling) Suppose in state s, there are F = \{n G M\z(s) n = 0}\free machines, and the 
number of jobs remaining to be processed is K = \{j G J\x(s)j = l,y(s)j = 0}|. Then, under the optimal 
policy 7r*(s), the number of job-processor pairs executed in the next time slot will be: 

\A\ = min{K,F}. 

i.e. the optimal policy is non-idling. 

Proof: The proof is by contradiction. What needs to be shown is that nothing can be gained by idling 
(\A\ < mm{K, F}). Suppose that under the optimal policy, a processor remains free (idles), even though 
there is an available job to work on. Consider another policy tt which is identical to the tt* policy except it 
begins processing all jobs on the idling machine one time slot earlier. Due to assumption |2j by processing 
the jobs earlier, this will result in an increase in reward. This contradicts the optimality of the idling policy; 
hence, no optimal policy will idle. ■ 
Now consider two systems which are identical, except one machine is tied up longer in the second 
system. The following lemma says that the maximum amount of additional revenue accrued by the first 
system for being able to start processing earlier is given by the reward rate of the greedy job; that is, the job 
of maximum reward rate amongst those in processing or waiting. 

Lemma 3 (Greedy Revenue) Consider a state St = s in time slot t and let g denote the index of a greedy 
job, i.e. g = argmax Jg j s ^ w J ; j^ + j Tj ^ for all jobs which are mid-processing or have not started (J s = {k G 
J\x k (s) = I}). 

Denote by s g and two states which are related to state s in the following manner. The two states 
are identical to state s, except on free machine n g (z(s) ng = 0). In state s g , machine n g is occupied by a 
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replica of job g meaning it has the same service time as job g, however, its completion does not generate any 
rewards nor does it effect the completion of the original job g. Similarly, in state Sj, machine n g is occupied 
by a replica of job i. Said in notation: x(si)j = x(s g )j = x(s)j and y(si)j = y(s g )j = y(s)j for all j, 
z(si) n = z{s g ) n = z(s) n for all n ^ n g , and z{s g ) ng = g while z(si) ng = i for some arbitrary job index i 
and machine n g . Then, 

E[J ^ )] * i 1 - ffej + E[m Z7 a ^ E[wAt + aa)] + e[j:(S9)] 

Proof: We begin by coupling the systems such that they see the same realizations for service times. Note 
that the replicated jobs which currently occupy machine n g need not have the same service time of their 
original jobs, i or ^-despite having the same distribution. 

Consider a policy tt for the s 9 -system which mimics the vr*(sj) policy. While processor n g is occupied 
by replica job g, which blocks processing of other jobs, the s g -system will simulate the service time of jobs 
on processor n g . There are two possible cases, <jj > a g and Oi < a g . 



Case 1, Oi> o g \ the tt policy idles on machine n g until t + Oi (time which machine n g is free in the Sj- 
system). At this point, the s 9 -system is 'synced' with the Sj-system and it proceeds with executing the 
optimal policy for the Sj system, 7r*(sj). See Fig. [3]for a single processor example of such a scenario. 
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Figure 3: Case 1, a. t > a g . The optimal policy is used for the Sj-system, which processes jobs in order 
2,3,4,(7. The s 9 -system uses policy vr which mimics 7T*(sj). Because job i completes after job g, the tt 
policy idles. Note that job g is processed twice in the s g system because the first job is just a replica. Job i is 
only processed once in the Sj system because even though i is a replica, the original had already completed 
processing. 



If T*(si) is the completion time of job j in the Sj-system under optimal policy 7r*(sj),and Tj is the 
completion time of job j in the s 9 -system under the tt policy, then Tj = T*(si). Employing similar 
notation as before, we consider the reward-to-go on a single realized sample path of service times, 
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given by a and the event Uj > a g : 



J?(si\a,ai > (j g ) 



(17) 



E w ^ T i) 



J?(s g \a,<Ji > a g ) 



< Jt(s g \a,ai>a g ) 



< Jt(s g \a,ai>a g ) + 



E[Wg(t + (Tg)\(T,<Ji < CFg] 

E[a g \a,ai < a g ] 



E[a g - <Tj + cr max \a,ai < o g \ 



Case 2, Oi < a g : In this case, tt cannot exactly mimic ir*(si) policy because machine n g will continue to 
be busy after i completes in the Sj-system. The tt policy will simulate the processing of jobs on n g , 
while the machine is still busy. Let J s i m denote the set of jobs whose processing is simulated. Despite 
the fact that these simulated jobs will not actually be completed, the fr policy assumes they are. The 
tt policy continues to follow the 7r*(sj) policy until all jobs are 'completed' in the sense that they are 
actually completed or their completion was simulated because processor n g was busy under the s g - 
system when it was free under the Sj-system. The tt policy then finishes processing the simulated jobs 
(j £ Jsim) in an arbitrary manner so that they are actually completed. That is, the actual completion 
of the simulated jobs is transferred to after the rest of the jobs have completed processing. Fig. [4] 
shows an example sample path of this scenario. 

If T*(si) is the completion time of job j in the Sj-system under optimal policy, vr*(sj), and Tj is the 
completion time of job j in the s s -system under the tt policy, then Tj = T*(si) for all j Jsim,- 
Then (again employing the notation given the filtration of aj and the case Oi < a g ): 



Jt(si\a,ai < a g ) 



j&Js 



E w i( r /( s *))+ E w i( T j( s i)) 




< 




E W ^ T I (*)) + J t ( S 9\°i°i < 




(18) 
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Figure 4: Case 2, cij < a g . The optimal policy is used for the Sj-system, which processes jobs in order 
2,3,4. The s g -system uses policy vr which mimics 7r*(sj). Because job i completes before job g, the tt 
policy is blocked until t + a g . At time t + erj, the 7r policy simulates the processing of jobs 2 and 3 on 
machine n g . The machine will idle once replica job g completes and before job 3 finishes its simulated 
processing. At time r, the fr policy is able to follow the 7r*(sj) policy. Then the simulated jobs 2 and 3 
are completed in an arbitrary order after the TT*(si) policy completes at time T Si . Note that job g and i are 
processed once in each system because the original jobs have already completed processing (the replicas are 
processed by time t). 



The first inequality comes from the non-negativity of rewards. The third equality comes from our 
coupling and the definition of the tt policy. The last inequality comes from the optimality of J t * . 

Taking expectations over the aj, or equivalently the T*(si), and using a little algebra for (PT8l) : 

Jt{si\ai < o g ) < 

< 

< 

< 
< 

The second inequality comes from the fact that for all j, T*(si) > t + <jj since the earliest time a job 
can begin processing is t and all wj(t) are non-increasing in t (Assumption |2]). The forth inequality 
comes from the definition of job g. Now, consider the total service time of simulated jobs. Simulated 
jobs begin at t + cr^ and finish at r > t + a g . In particular, there exists some I such that the first time 
machine n g is free under policy tt is r < t + a g + o\, i.e. I is the last simulated job (job 2 in Fig. |4]). 



]T EiwjiT^s^aiKa^ + J^SglaiKag) (19) 

im 

E^r i ..E[wj(t + aj)\(Ji < a q ] , \ 

EfaWi < ° 9 ] 3 F r rf ' ~ i 9 + Jt(s a Wi < <?g) 

J cj s im 

E[w k {t + (Tk)Wi < cr g ] ^ I ^ l , t*/ I ^ n 

k hj\a k \ai < a g \ 

3^<J sim 

E[w g (t + a g )\ai < a g ] 

F r | ' T-^-E\o g +Ol- Oi\Oi < o g \ + J t { Sg \Oi < a g ) 

&[Vg\0i < G g\ 

E[w g (t + a g )\ai < a g ] 

E[<Tg\<Ji < Og] 



E[a g - <Ti+ cr max |crj < a g ) + J^(s g \ai < a g ) 
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Hence the total service time of simulated jobs is bounded above by (t + a g + 07) — (t + <Ji). This 
yields inequality four. 

Combining ( fTTT ) and ( fl9l ), and taking expectations over <jj > a g and cr, < <r s yields: 

J?(si) < E[Wg ^ g] <Tg)] fa*\ ~ E[oi] + E[a m J) + T t {s g ) 
( E[ai] E[a m£LX ]\ 

= ( 1 -^-j + ^ r J^K(* + ^)] + ^(^) 

which concludes the proof. ■ 
Suppose we were able to process a job without using a machine. The total reward gained by the use of 
this 'virtual machine' is greater than the reward gained without the use of it. Define S' : S x J — > S as the 
operation/function which reduces state s to state = S'(s, i) by removing job i which has not yet begun 
processing in state s. That is, starting in state s, select a job i that has not been completed. Complete job i 
and generate its associated reward without tying up a processor. Said in notation, \/n : z{s' i ) n = z(s) n ; and 
Vj / i: x(si)j = x{s)j and y(s' i ) j = y{s)j, but x{s' i ) i = (x(s)j - 1) + and y{s' i ) i / . 

Lemma 4 (Virtual Machine Rewards) For all states s and any job i, let state S'(s, i) denote the resulting 
state if job i were processed without occupying a processor. Also, reward Wi is generated upon completion. 
Then: 

J?(s) < E[ Wl {t + <Ti)\ + J?(5'(a,i)), 

Proof: Consider a coupling of the systems starting in state s and = S'(s, i) such that they see the same 
realizations of the service times for all jobs. Let vr*(s) denote the optimal scheduling policy starting from 
state s. 

In the s^-system, we call job i a 'fictitious' job. It is fictitious because it does not actually exist (it 
has already completed) under the s--system. Consider a policy tt which assumes that job i is a 'real' 
(available/not processed) job and executes the optimal policy under this assumption, i.e. it at time slot 
t, it assumes it is in state s (rather than s[) and executes the optimal policy ir%(s). When fr schedules job i, 
there is no job to actually process, so the processor will idle while it simulates the processing time for job i 
which is identically distributed to o"j under the s-system. See Fig. [5] for a single machine example of the tt 
and 7r*(s) policies given a sample path for service time realizations. 
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Figure 5: Virtual machine: A single server scenario. Under the s^-system, job 2 is processed on a virtual 
machine. The s-system uses policy vr(s') to optimally process all jobs j = 1, 2, 3, 4. The s^-system uses 
policy 7r which mimics vr*(s). Because job j = 2 has already been processed on the 'virtual machine', the 
tt policy idles. 



Let Tj be the completion time of job j under the tx policy. Note that T,; is the completion time of the 
fictitious job, i. Let t, denote the random time which job i begins 'processing' under this policy. Under our 
coupling, Tj is precisely tj plus the processing time of job j under vr*(s) for the s-system. Hence, 

Jt(sWj) = ^Wj{Tj) 

j 

= J2 w j( T j) + Wi{Ti) 

= Jt{ s 'iWj) + w i{ t i + a i) 

< J*(s'i\aj) + Wi(t + Ui) 

The inequality results from the non-increasing property of the reward functions in Assumption |2] and from 
the optimality of J t *(-). Taking expectations over <jj yields the desired result. ■ 
We are now in position to prove the main result of this paper. Let A = ^^^r~j as in (fl~6l) . 

Theorem 2 For all states s £ S, the following performance guarantee for the greedy policy holds: 

j;(s)<(2 + A)Jf(s). 

Proof: The proof proceeds by induction on the number of jobs remaining to be processed, Yljej 1{%=0}- 
The claim is trivially true if there is only one job remaining to be processed-the greedy and optimal policies 
will coincide. Now consider a state s such that ^ • l{ y ( s ) j = 0} = K, and assume that the claim is true for all 
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states s' with K > £\ l{y(s')j=V>}- 

Now if 7Tj (s) = 7if (s) the then the next state encountered and rewards generated in both systems are 
identically distributed so that the induction hypothesis immediately yields the result for state s. 

Consider the case where vr 4 *(s) 7^ vrf(s). Denote by J7* and J g the set of jobs processed by the optimal 
and greedy policies in state s. Note that these sets depend on the current time slot t and the state s; however, 
we suppress them for notational compactness. Recall that, by Lemma|2l |i7*| = \J g \- Let A* and A g denote 
the optimal and greedy scheduling policy, respectively, given state s in time slot t. 

Taking definitions from before, we define S(s, A) as the random next state encountered given that we 
start in state s and action A is taken. Also, define S'(s, i) as state s with the completion of job i, i.e. job i is 
completed (xj = 0) without using a processor. 

Define the operator S : S x A —> S which transforms state s by tying up machines with replicated 
the jobs defined by A. That is, s = S(s, A) is the state where jobs begin processing on the machines 
given by A, but no reward is generated for their completion and they remain to be processed at a later time 
(reward is generate upon this second completion). This second completion may occur prior or following 
the completion of the replicated job. A defines which jobs are replicated and which machine they are 
processed on, and hence occupy-replicated jobs do not generated any reward. Put another way, s is a new 
state where machines are occupied for an amount of time defined by the service times of jobs in A. Said in 
notation, x(S(s, A))j = x(s)j and y{S(s, A))j = y(s)j for all j, while z(S(s, A)) n = j if (J, n) G A and 
z(S(s,A)) n = y(s) n otherwise. 

We have: 

J t *( S ) = ^Elwjit + a^+ElJttSfrAJ)] 

^ E ^Eiw g (t + a g )] + E[J t *(S(s,A*))] 

^ E §^E[w g (t + a g )] + +E[j;(S(s,A,))] (20) 

The first inequality comes from the definition of the greedy policy; the reward rate for greedy jobs is higher 
than for the optimal jobs. The second inequality comes from Lemma[T]by putting back the jobs in A*. That 
is the machines are occupied by replicas of jobs defined in A*, but the original jobs are placed back to be 
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completed at a later date. These additional jobs generate more reward as shown in LemmaQ] 

Continuing d20l ), we switch A* with A g . That is, instead of tying up the machines with replicas of the 
optimal jobs, they are typed up with replicas of the greedy jobs. Because |J7*| = \J g \ and the processing 
times on each machines are identical, we can consider each machine individually and use Lemma[3]so that, 

£ ^-E[w g (t + a g )] + E[j;(S(s,A*))} 

E[ai] 



^ E -J^\E[w g (t + a g )]+E[j;(S(s,A g ))} 
(i,g)e(J*,J g ) [(Jgl 



E ^K(*+^)](l-^+-^- r ) + 

(i,g)e(J*,J g ) [ 91 1 9i 

= ^E[w g (t + a g )](l + ^f^)+E[j;(S(s,A g ))] (21) 

9 eJ g 

Continuing (|2TT) . we now complete the greedy jobs without occupying any machines: 



^%(i + , 9 )](l + ^i) + E[J t *(S(s,A g ))} 
jej g 



< Yl E[v g (t + a g )] (2 + + E[J t *(S(s, Ag))) 

* E( 2 +;J^>K(^)H 



9 



2+ f [g °r ] )^[J?(g( a ,A>))] 

mirife -c/[crfcj / 
2 + A)jf( S ) 



The first inequality comes from use of 'virtual machines' for the greedy jobs under Lemma [4] The second 
inequality comes from the induction hypothesis. This concludes the proof. ■ 



4 Special Cases 

As shown in [9, 10] the greedy policy is optimal, for linear or exponential decaying reward functions. Under 
a few other special cases, the bound in Theorem |2] can be improved. 
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4.1 Identical Processing Times 

Suppose that all job service times are independent and identically distributed, i.e. in the case of Geometric 
service times, pj = p for all j. In general, there is no closed form equation for i?[cr max ]; however, in this 
case, the bound can be improved to a factor of 2. To do this, Lemma[3]needs to be modified. 

Lemma 5 (Greedy Revenue, I.I.D processing times) Consider a state St = s in time slot t and let g denote 
the index of a greedy job, i.e. g = argmax Jg j- a y or a Uj }j S which are mid-processing or have 

not started (J s = {k £ J\xk{s) = 1}). Denote by s g and s, two states which are related to state s as 
follows: x(si)j = x(s g )j = x(s)j andy(si)j = y(s g )j = y(s)j for all j, z(si) n = z(s g ) n = z(s) n for all 
n rig, and z(s g ) ng = g while z(si) ng = ifor some arbitrary job index i and machine n g . That is in state 
S{, machine n g is occupied by a replica of job g; and in state Sj, machine n g is occupied by a replica of job 
i. Then, 

E[J t *(s t )]=E[J t *(s g )] 

Proof: Couple the systems such that they see the same realizations for service times of job i and job g 
which are currently occupying machine n g . This coupling is possible since the jobs are i.i.d. Therefore, 
under this coupling there is no difference between state Sj and s g since these 'jobs' are only occupying the 
machine but are not generating any rewards. Hence, E[J£(si)] = E[J£(s g )]. ■ 
Now we are able to prove an improved bound on the performance of the greedy policy. 

Theorem 3 Let the service time for job j be distributed according to density function fj(cr). If all job 
service times are independent and identically distributed according to f(o~), i.e. fj(o~) = f(cr) Vj, then for 
all states s E S, the greedy policy is guaranteed to be within a factor of 2 of optimal: 

J t *(s) < 2Jf(s). 

Proof: Under this scenario, Lemma [3] can be replaced by Lemma [5] in the proof of Theorem |2] Hence, 
E[Jt(S(s, A*))] = E[Jt(S(s, A g ))] since the distribution of completion times is identical, the amount of 
time a processor is busy is independent of which job it is processing. Instead of replicating the entire proof 
here, we examine how (|20T >. (|2TI) . and (|22l change. 
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The only difference for (1201) is that E[<jj\ = E[<Ji\ for i, j which allows for a slight simplification. 

j;(s) = ^^K(t + a,-)] + E[J*(5( S ,^))] 

jeJ* 

^ E ^E[w g (t + a g )} + E[J t *(S(s,A*))] 

< ^2E[w g (t + a g )]++E[J*(S(s,A,))] (22) 

9&Jg 

Now, with improvement to Lemma[3]in Lemma[5l (|2TT | is reduced significantly 



E[w g (t + a g )] + E[J t *(S(s,A*))] = £ E[w g (t + a g )] + E[J*(S(s, A g ))) (23) 

ffGJs g&Jg 

Finally, utilizing Lemma|4]and completing/generating rewards for the greedy jobs gives: 



^E[w g (t + a g )] + E[J?(S(s,A g ))) < 2 £ E[w g (t + a g )] + E[j;(S(s, Ag))] (24) 
g&Jg g&Jg 

< 2Y,E[w g {t + a g )} + 2E[r g {S{s,A g ))] 
g£Jg 

= 2Jf{a) 



In the case of i.i.d. service times, the greedy policy corresponds to scheduling the job with the highest 
expected rewards over their identical completion times. While this seems to be an intuitive policy, the 
following example shows what can go wrong. 

Example 2 Consider the case with 2 jobs and 1 machine ( J = 2 and N = 1). We begin at t = 0. Assume 
that neither job has begun processing so that x± = %i = 1 and yi = U2 = 0- The service times for job 1 
and 2 are both deterministic and equal to 1. The reward functions are: 



For j = 1 : Wj(t) = < 
For j = 2 : Wj (t) = ji, Vi 



1 - e, t = 1 
0, t > 1 
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for e > 0. So that the completion of job 1 only results in revenue if it is completed in the first time slot, 
but job 2 results in the same revenue, regardless of which time slot it is completed in. Therefore, the reward 
rates are: 



E[ Wl (t + <?{)] 
E[a x ] 

E[w 2 {t + a 2 )] 
E[a 2 ] 



( 

1 - e, t = 1 

0, t > 1 

1, Vt 



Clearly, the greedy policy is to schedule job 2 and then job 1 since the reward rate for job 2 is greater 
than that for job lfl + e > 1). However, when job 1 completes at t = 2, it generates no reward since 
w±(2) = 0. This results in reward 1. On the other hand, the optimal policy realizes the reward of job 1 is 
degrading and schedules it first and schedules job 2 second. This results in reward 2 — e. We thus see that 
J t *(s) = (2-e)J t 9 (s) here. 

In light of the example just shown, the bound in Theorem[3]is tight. 
4.2 Slowly Decaying Rewards 

We have proven a worse case bound for arbitrary decaying rewards. If the time-scale of decay is very long 
compared to the time-scale of job completion times, then the rewards would be nearly constant during the 
processing time of a job. In particular, as the decay goes to zero over the time-scale of job completion times, 
the performance of the greedy heuristic approaches the performance of the optimal policy. 

We will now formally define the time-scale of decay. Consider a difference equation specification for 
the time-scale of decay. Let 

5 = max E[wk(t) — Wk(t + u m )\ > 0. 

t,k,m 

We will show that as 5 — > 0, Jf (s) — ► J£{s). To do this, we must start with a few preliminary results. 

The first is, as 5 — ► 0, rewards become invariant to the completion time. Rewards are generated upon 
the completion of each job. However, as 5 — > 0, the rewards generated at the completion time of a job is 
nearly the rewards that would have been generated had the job had processing time. 

Lemma 6 (Time-Invariant Rewards) For any jobs i,j and time slot t, as the time-scale of decay, 5, ap- 
proaches 0, the reward generated for completing job j is invariant to shifts in time by the service time of job 
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i, G{. In particular, 

lim E[wj(t + dj)] = itfj(t) 
Proof: For any job indices i, j and time slot t: 

\E[wj(t + ai)]-Wj(t)\ < max\E[w k (r + a m )]-Wk(T)\ 

T,k,m 

= S (25) 

which implies that \E[wj(t + ui)] — wj(t)\ — > as 5 — > 0. ■ 
Because rewards are nearly constant over the time-scale of job completions, starting a job <jj time 
slots later does not significantly reduce the aggregate reward accrued. The following lemma is similar 
to Lemma [3] for slowly decaying reward functions. Define S(s,A) as in Section [3721 so that S(s,A) is 
the state where jobs are processed on the machines given by A, but they are not removed and no reward is 
generated for this initial processing. These replica jobs occupy the machines, making them unable to process 
other jobs in the meantime. However, they do not generate reward. In notation, x(S(s, A))j = x(s)j and 
y(S(s, A))j = y(s)j for all j, while z(S(s, A)) n = j for all (j, n) £ A and z(S(s, A)) n = z(s) n otherwise. 

Lemma 7 (Delayed Machine) Let s = S(s, A) denote the resulting state if machines in A are occupied, but 
all the jobs have the same (unprocessed state as in state s. Then, starting in any state s and given action 
A, the difference in optimal reward-to-go generated in states s and s goes to as the time-scale of decay, 5, 
goes to 0, i.e. 

\Jt(s) -E[j;(S(s,A)] \ ^0 as 8^0 

Proof: To begin, note that J t *(s) > E[J*(S(s, A))]. To see this, we couple the job completion times. Let 
7r denote a policy starting from state s, but mimicking the optimal policy starting from state S(s, A) = s, 
7T* (s). Therefore, under the tt policy, machine n& will idle for o~k time slots before proceeding if (k, n^) G A. 
The s-system simply delays processing any new jobs until the replica jobs in the s-system are completed. 
In this case, the completion time for jobs will be identical under the tt and vr*(s) policies. Hence J*(s) = 
Jt( s ) < J t( s )> b y tne optimality of J t *(s). 

Now to show the convergence result, couple the job completion times under the s and s-systems. Let 
a\ = maxff;,n)eA °k be the maximal service time for jobs in A. Consider a policy tt for the s-system which 
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idles for at time slots and begins processing new jobs at time t' = t + a^, but assumes that t' = t. Therefore, 
tt coincides precisely with vr*(s) shifted in time by a k . In other words, the ff policy waits until t' at which 
point all replica jobs are completed and then begins processing new jobs as if no time has passed and t' = t. 

For the s-system, let T* be the completion time of job j under the optimal policy vr*(s). Then TJ = 
T* + a* k is the completion time for job j under the fc policy. Now, given some e > 

\j t *(s)-E[J t *(s)]\ < \J t *(s)-E[4(s)]\ 

= |^e[^(t;)-eK(t; + ^)]] 

3 

< JS <e (26) 

The inequality comes from Lemma[6]and because S — ► there exists 5 < e/J. ■ 
Now, we are in position to prove that the performance of the greedy policy approaches the performance 
of the optimal policy when the decay of rewards is slow compared to the job completion time. 

Theorem 4 (Slowly Decaying Rewards) For any state s £ S, as the time-scale of decay goes to 0, i.e 8 — > 0, 
the performance of the greedy policy approaches that of the optimal policy. 

Jf(s) - J*(s). 

Proof: The proof is by induction on the number of jobs remaining to begin processing. Clearly, when 
only one job remains the greedy and optimal policies coincide. Now we assume it is true for K — 1 jobs 
remaining and show that it is true for K jobs. 

Denote by and J g the set of jobs processed by the optimal and greedy policies in state s. Recall that, 
by Lemma |2j | J7* | = | J g \ . Let A* and A g denote the optimal and greedy scheduling policy, respectively. 
As before, we define S(s, A) which is the next state given we start in state s and take action A and S(s, A) 
which is the state with machines in A occupied by replica jobs which generate reward. 

Suppose we are given e > 0. Define S €j i such that for all 5 < S Si i, \J£(s\aj) — J t *(5(s, A)|crj) < e/2; 
this is possible due to Lemma|7] Define such that for all 5 < (5 e ,2, \Jti s ') ~ Jf( s ')\ < e /2 f° r an Y s ' 
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with K — 1 jobs remaining; this is possible due to our inductive hypothesis. Then let S e = min{5 ej i, 8^,2]- 
For any 5 < S e : 



J t *(s) < E[J t *(S(s,A g ))] + e/2 

< %(* + + A,))] + e/2 

< Y. E ^ + ^j)] + E[J!{S{s,A g ))]+e 
= J 9 t(s) + e 

The first inequality is due to Lemma [TJ for state s and action given by A g . The second inequality is by 
Lemma|4]for removing the greedy jobs. The third inequality is by the inductive hypothesis. 

By the optimality of J t *. \ j*(s) - Jf(s)\ = J t *(s) - Jf (s). So for S < S e , | J t *(s) - J t 9 (s)| < e, which 
proves our claim. ■ 

This result is intuitive because as the time-scale of decay becomes negligible to the time-scale of job 
completion times, rewards can be viewed as essentially constant. As such, it does not matter which order 
jobs are completed, since all will be completed. Hence, any policy, and certainly the greedy policy, is nearly 
optimal. However, the convergence rate to optimality will vary across policies. 



5 Performance Evaluation 

In the previous sections, we have shown performance guarantees for a greedy policy when scheduling jobs 
with decaying rewards. In light of Example Q] and Theorem |2j the loss in performance due to use of the 
greedy policy can be at least A + 1 but can do no worse that A + 2. In this section, we show that, in practice, 
the greedy performance is likely to be much better. 

In order to enable computation of an optimal policy we assume that the number of jobs is finite and 
small (2-10). Even with a finite number of jobs, |<S| grows exponentially fast in several problem parameters 
which forces us to limit the size of the problem instances we consider. In particular, we consider problems 
with a single machine, M = 1, and jobs with finite deadlines less than 100. That is no reward is accrued 
after t = 100. We assume job completion times are Geometric with pj evenly distributed between p m in 
and Pmax = -9. Since there is no closed form distribution for cr max , see Appendix |A] for how to find 
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Figure 6: Different types of decaying functions. 



an upper-bound to -©[Umax] and, subsequently, an upper-bound to A. We consider a number of decaying 
reward functions depicted in Fig. [6] The constants defining each reward function are drawn uniformly; all 
experimental results are averaged over 100 different realizations of these constants, with 1000 experiments 
for each such set. 

In Table [T] we summarize the performance of the greedy policy for the reward functions shown in Fig. 
[6] In this case, p m - m = .1 and p max = .9; miny E[<7j] = \ = 1-H and E[max.j o~j] < 27.3, therefore, 
2 + A < 26.6. We can see that while the optimal policy achieves larger reward than the greedy policy, 
the gains are within a factor of 1.20-much better than the guarantee provided by Theorem [2] Because we 
have finite deadlines for each job, there exists some T max such that for all j, Wj(t) = for all t > T max . 
Therefore, the reward function with exponential decay is slightly modified from the standard notion of 
exponential decay where Wj(t) — > 0, but Wj(t) > for any t < oo. Hence, the greedy policy is not optimal 
for this exponential decay with finite deadline. 

It is interesting to note that the performance of the greedy policy seems to degrade as the number of jobs 
increases. We examine this more closely in Fig. |7]under a step function reward function where rewards are 
constant until a fixed deadline as in Fig. [6^. Clearly, the greedy and optimal policies coincide when there is 
only one job. As the number of jobs increases, the performance of the greedy policy degrades; however, the 
loss in performance is much less than the bound of 2 + A < 26.6 guarantees. 2 + A is a worse-case bound 
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Type 


Jt/Jf 


J = 2 


J = 5 


J = 8 


Step 


1.0065 


1.0931 


1.1287 


Linear 


1.0133 


1.0576 


1.1289 


Exponential 


1.0609 


1.0433 


1.0590 


Parabolic 


1.0265 


1.0382 


1.0667 


2-step 


1.0218 


1.1007 


1.1520 



Table 1 : Performance of Greedy policy versus Optimal Policy for different types of decaying reward func- 
tions for J jobs. 

1.18| ! 1 1 ! 1 




2 4 6 8 10 



J 

Figure 7: Performance loss (%) as the number of jobs (J) increases. 



and while there are degenerate cases whose performance approaches this bound; it seems that in practice, 
the performance of the greedy policy is likely to be much better. 

From Theorem|2l the performance of the greedy policy is dependent on A, the ratio between the largest 

and smallest expected service times. In our previous experiments, we have seen that 4 < 2 + A. We 

J t 

now examine if the performance of the greedy policy will vary significantly as we change A. We fix 
Pmax = -9 and vary p m - m G [.01, .8]; this varies the upper-bound of A (as calculated in Appendix lAl). 
Aub G [5-2, 265.9]. In Fig. [8j we see how the performance of the greedy policy (%) varies with A. As 
expected, as A increases, so does the loss in performance. However, it is interesting to note that A must be 
very large before the degradation in performance is significant. In fact, for a large range of Ajjb £ [1> 100], 
is nearly constant and the greedy policy performs within 10% of optimal. Even when A = 260, -4g < 1.3. 
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Figure 8: Performance loss (%) as Ajjb increases. 



A loss of 30% is much better than the theory guarantees. 

Depending on the system parameters, A can be arbitrarily large which would lead to arbitrarily large 
degradation in performance of the greedy policy. While we have seen via Example Q] that the performance 
of the greedy policy can be highly dependent on A, we suspect this to be a degenerate example. We expect 
that in practice, the performance of the greedy policy to be closer to performance of the optimal policy. 



6 Conclusion 

In this paper, we have studied online stochastic non-preemptive scheduling of jobs with decaying rewards. 
Arbitrary decaying reward functions enables this model to capture various distastes for delay which are more 
general than the standard exponential or linear decay as well as fixed (random or deterministic) deadlines. 
Using stochastic Dynamic Programming techniques, we are able to show that a greedy heuristic is guaran- 
teed to be within a factor of A + 2 of optimal where A = ^™ a ^^| is the ratio of largest to shortest service 
times. While there exist degenerate scenarios where the performance loss of the proposed policy is near this 
worse-case bound, we expect that the performance loss to be much smaller for many practical scenarios of 
interest. 

This is a first look at non-preemptive scheduling with arbitrary decaying rewards. Some questions that 
remain are how to account for job arrivals and processor dependent service times. When there are job ar- 
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rivals, due to the non-preemptive service discipline, it may be optimal for a machine to idle in order to allow 
the machine to be free upon arrival of the new job. However, doing so requires some estimate or knowledge 
of future jobs arrivals, which may not be available. Also with processor dependent service times, optimal 
policies may also call for idling. Consider a scenario where one machine is much faster than the rest. Then 
an optimal policy may process all jobs on this fast machine, causing the other machines to idle. Allowing for 
idling policies significantly complicates the optimization problem at hand. One option is to only consider 
non-idling policies and maximize reward over this class of policies. It can be shown via a highly degenerate 
example that requiring non-idling service disciplines can significantly degrade performance. However, for 
many scenarios this constraint is very natural. For instance, in service applications, such as health-care facil- 
ities, making customers (patients) wait when there are available servers (doctors) is unlikely to be tolerated. 

These are just some extensions to this general model we have analyzed. In this paper, we have considered 
the performance of an online scheduling algorithm for jobs with arbitrary decaying rewards. We have shown 
a worse-case performance bound for this policy compared to the optimal off-line algorithm. While there are 
some rare instances when the loss in performance of the proposed greedy policy is significant, in practice, 
we expect the performance loss to be small. This, along with the simplicity of this algorithm, makes it highly 
desirable for real world implementation. 
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A Bound on a max 

Suppose the service time of job j is Geometrically distributed with probability pj. Furthermore, pj is 
uniformly distributed between [PmimPmax]- 

Using the fact that Oj is Geometrically distributed, i.e. Pioj < x) = 1 — (1 — pj) x gives: 



-P(0"max > x) 



1 



1 





(27) 



Finding the expectation of a m£LX gives: 



oo 




x=0 



oo 




(28) 



x=0 
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We can numerically solve (l28l) to get an upper-bound on -Efcmax] and hence, an upper-bound on A. In 
particular: 

oo 

A< A c/B =p max ^ [l- (1- (1- Pmin )-) J 

2=0 
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